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REMARKS  ON  ERRORS  IN  FIRST  ORDER  ITERATIVE  PROCESSES  WITH 

FLOATING-POINT  COMPUTERS 

We  consider  the  iterative  process  given  by 

x   _  =  x  +  G(x  )  (l) 

n+1    n    v  ny  ' 

with  limit  r.   All  quantities  are  scalar.   We  suppose  the  convergence  linear, 
i.e.  there  exists  0  <  b  <  1  such  that 

|  x  +  G(x)  -  r   <  b  [  x  -  r    for  every  x  (2) 

Although  analogous  results  can  be  probably  obtained  for  other 
types  of  floating-point  arithmetic,  we  suppose  we  are  dealing  with  a  binary 
computer  with  following  properties: 

1.  All  numbers,  but  0,  are  of  the  form  a  2  ,  where  a  is  an  exact 
binary  fraction  of  N  bits  and  the  sign,,  p  is  an  integer  and 
0.5  <  a  <  1. 

2.  There  is  a  real  zero  represented  for  example  by  a  =   0,  p  =  -P, 

where  -P  is  the  smallest  value  of  P;  consequently  the  smallest 

-P-l 
non-zero  numbers  in  absolute  value  are  +  2 

All  these  numbers,  including  zero,  will  be  called  "normalized" . 

Suppose  that  (l)  is  realized  on  this  computer  under  the  following 
assumptions: 

1.  The  value  effectively  computed  instead  of  G(x)  is  G(x)  with 

G(x)  =  (l  +  y)  G(x)  +  £  |  T)  |  <  d  |  S[  <  a  (3) 

£  and  y  depend  on  x;  y  and  t,  are  independent  of  x. 

2.  G(x)  and  the  successive  approximations  are  always  represented  on 
the  computer  as  normalized  numbers. 

The  effective  process  can  be  written 

Y   _  =  [  Y  +  G(Y  )]n  (k) 

n+1     n    v  n'  R  v  ' 

indeed  by  using  multiple  precision  Y  +  G(Y  )  can  be  represented 

-P-l 
exactly,  since  it  is  a  multiple  of  2    ;  however,  by  assumption  2, 

the  mantissa  of  Y   n  has  no  more  than  N  digits  and  Y  +  G(Y  )  must 
n+1  D         n      n 

be  rounded  as  indicated  by  [  ]  . 

R 
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We  concentrate  on  attention  on  the  rounding  procedure  in  (k) . 
We  consider  two  types  of  rounding  procedures: 

1.  Normal  rounding:   Y  ,  =  [Y  +  G(Y  )  LT  ;  Y  .  is  a  normalized 
-    n+1     n    v  n7  N    n+1 

number  such  that  |y  ,  -  (Y   +  G(Y  ))  I  =  min; 
1  n+1   v  n      n  ;  ' 

When  two  different  normalized  numbers  satisfy  the  above  relation, 

any  of  them  can  be  chosen  as  Y 

2.  Anormalous  rounding:   Y  .  =  [Y  +  G(Y  )].: 
— n+1     n      n  A 

if  G(Y  )  >  0  let 
n  — 

Z  be  the  smallest  normalized  number  such  that  Z  >  Y  +  G(Y  ) 

—  n      n 

W  be  the  greatest  normalized  number  such  that  W  <  Y  +  G(Y_); 

If  G(Y  )  <  0  let 
n  — 

Z  be  the  greatest  normalized  number  such  that  Z  <  Y  +  G(Y  ) 

—  n      n 

W  be  the  smallest  normalized  number  such  that  W  >  Y  +  G(Y  ) 

—  n      n 

then  [Y  +  G(Y  )  ].  =  W  if  W  /  Y 

n      n  A  '   n 

[Y  +  G(Y  )].  =  Z  if  W  =  Y 

n      n  A  n 

The  following  relations  are  rather  evident: 

1-  I  Y  x  -  [Y  +  G(Y  )]..  |  ■:  2_N  (Y  +G(Y))  (5) 
1   n+1     n      n  N  '  —       n      n 

2-  |  Y  ^    -  [Y  +  G(Y  )],  |  <   2-W+1  (Y  +  G(Y  ) )  (6) 

i   n+2_     n    '  n  A  '  —         n      n 

3-  if  Y  <  Y  +  G(Y  )  <  p,  then  Y  <  [Y  +  G(Y  )].  <  p  (7) 

n    n      n'   *'  n     n      n  A 

if  p  <  Y  +  G(Y  )  <  Y  ,  then  p  <  [Y  +  G(Y  )  ]  A  <  Y  (8) 

n      n     n  n      n  A    n 

where  p  is  any  number  and  provided  there  is  a  normalized  number 
s  such  that  Y  <  s  <  p  for  (7)  and  p  <  s  <  Y  for  (8) . 

Theorem   a)   By  using  the  normal  rounding  for  any  Y  ,   there  exists  a  finite 

number  M  such  that  I  Y  -  r  I  <  B,  = 

1   n     '  —  N 

o-N  ii      /.    0-Nx            0-N  |    | 
2    1  r   +  a  (1  +  2   ) _2 r  ;  +  a 

2  +  2"N  -  (1  +  d)  (1  +  b)  (1  +  2"W)   "  1  -  b  -  2d  -  2"N 

for  n  >  M. 
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b)   By  using  the  anormalous  rounding,  for  any  Y  ,  there  exists  a 
finite  number  M  such  that 

I  «r  It,    ^o-N+1  I    I    o-p"!       a  (l  +  2"N+1)    x 

I  Yn  "  r  I  <  BA  =  (2      I  r  I  +  2     +  2  -  (l+d)  (l+b)  } 

for  n  >  M. 

In  both  cases,  if  the  bounds  B  or  B  are  non-positive,  they  must 
be  replaced  by  +  oo. 

Truncation  errors  Suppose  we  compute  with  infinite  precision,  i.e.  without 
rounding  errors . 

The  remaining  inaccuracy  of  the  process  will  be  called  the  truncation 
error  and  comes  from  the  errors  t)  and  £  in  equation  3- 

We  consider  the  limits  of  B  and  B.  when  N-»  oo  and  P-»  oo 

B  =  lim  B  =  lim  B.  = 

K-»  oo     N-»  oo  A   2  -  (l  +  d)  (l  +  b) 

F-»  CO 

Using  analog  agruments  to  these  in  the  proof  of  the  theorem,  one  can  find  the 
following  result: 

Let  for  any  V  ,  the  sequence  V  be  defined  by 

V  .  =  V  +  G(V  ) 
n+1    n      n7 

then  any  point  of  accumulation  V  of  the  sequence  satisfy  the  relation 
V  -  r   <  B.  We  give  an  example  where  the  bound  is  reached;  let 
G(x)  =  -  (l  +  b)  (x  -  r) 
G(x)  =  -  (1  +  d)  (l+b)  (x  -  r)  -  a  ■  X  "  r  ■ 

First  we  remark  that  if  a  =  0,  the  sequence  will  converge  if  and  only  if 
|  1  -  (l  +  d)  (l  +  b)  |  <  1,  i.e.  2  -  (l  +  d)  (l  +  b)  >  0,  since  d  and  b 
are  non-negative  numbers;  if  the  condition  is  not  satisfied,  the  sequence 
diverges  to  infinity. 

For  a  /  0,  it  is  easy  to  verify  that  if 
Vo  =  r  +  2  -  (1  +  d)  (1  +b) 


UNIV 
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then  V.,  =  r 


2  -  (l  +  a)  (l  +  b) 


a 
V2  =  r  +  


2  -  (l  +  d)  (l  +  b) 

In  order  to  compare  the  results  of  the  theorem,  i.e.  to  compare 
B.  and  B  ,  first  suppose  that  a  =  0;  then 

B.  -  I  r  I  2"N+1  ♦  S"^1  '  r  '  2"N 


A  "  H   2  +  2"11  -  (1  +  d)  (1  +  b)  (1  +  a"") 

-P-l 

Since  d  >  0,  b  >  0  it  follows  B  >  l/2  B.  -  2    ;  B.  is  independent  of  d  and 

b  and  remain  very  small;  for  d  =  0,  b  =  0,  B  is  slightly  smaller  than  B.,  but 

if  b  =  1^  i.e.  when  the  convergence  is  very  slow,  B  can  become  very  large. 

For  reasonable  values  of  b  and  d,  the  increase  of  value  of  the  bounds  B.  and 

B.T  due  to  a  /  0  are  almost  equal  (i.e.  if  one  neglects  the  effects  of  the 

rounding,  i.e.  if  N-»  «)  .   Consequently  the  anomalous  rounding  can  be  considered 

as  safer  than  the  normal  rounding. 

2"1  I  r  I  +  a 
Remark  In  the  theorem  a,  it  is  asserted  that  B  < ' ' —  •  This 

1  -  b  -  2d  -  2 
"bound"  of  the  bound  B  is  useful  when  b  =  1. 

Example  The  bounds  B.  or  B^T  can  be  reached  only  in  trivial  cases.   However, 

'  .A.       J/J 

for  the  general  case,  they  remain  realistic;  that  is  true  for  B.  since  B.  is 
not  much  greater  than  the  truncation  error;  as  for  B-T,  let  us  consider  the 
following  example: 

Let  b  =  J)/k,   d  -  1/8,  a  =  5  *  2~55,  H  =  32;  r  =  3A 

G(x)  =  -  ?A  (x  -  5A) 

G(x)  =  -  9/8  •  TA  (x  -  3A)  "  5  •  2"55  •  sign  (x  -  3 A) 

Bjj  =  kk    •  2"52;  BA  =  21.5  '  2"52;  B  =  20. 2~52, 

it  is  easy  to  check  the  following  computations: 

Y  =  3A  ►  32  •  2-52 

Yn  =  [Y„  +  G(Yj  ]w  =  3A  -  32  '  2"52 

-32 


Y2  =  [Y1  +  GCY^]^  =  3A  +  32-2' 


Proof  of  the  Theorem 

Lemma  1  Let  Wn  and  W  satisfy  the  relation 
1      o 

W1  =  (1  +  e)  (wq  +  (1  +  tj)  (g(Wq)  +  I)) 

where   t    \    <   e  =  constant  and  r\,   G(w) ,    t,   satisfy  the  hypothesis  given  "by  the 
equation  2  and  3« 

e   r   +  a  (l  +  e) 


Let  K 


2  +  e  -  (l  +  d)  (l  +  b)  (l  +  e) 


then  :   1.   if   W  -  r   >  K,  then   W,  -  r   <   W  -  r 

2.   if  I  W  -  r  I  <  K,  then  I  Wn  -  r  I  <  K 

Proof: 

W1  -  r  =  (1  +  €)  (WQ  +  (1  +  tj)  (G(WQ)  +  £))  -  r 

=  (l  +  e)  (l  +  T))  (wq  +  G(Wq)  -  r)  -  n  (l  +  e)  (WQ  -  r)  +  re  +  t,   (l  +  e); 

"by  equation  2: 

I  W-  -  r  I  <  |(l  +  e)  (l  +  T))  I  b  I  W  -r   +  |tj(i  +  e)|  I  W  -r   +   re   + 


L  -  r  | 

1     '  —  '   o 


r  |  [  (l  +  e)  (l  +  d)  (l  +  b)  -  1  -  e  I    +    |  r  |  e  + 


(9) 
a(l  +  c) 

First  suppose  |  W  -  r  |  >  K;  by  k-i 

I  Wn  -  r   <  I  W  -  r   -  (2  +  e  -  (l  +  e)  (l  +  d)  (l  +  b))  I  W  -  rl  + 

r  |  e  +  a  (l  +  e) 

<  |  W  -  r   -  (2  +  e  -(1  +  c)  (l  +  d)  (l  +  b))K  +   r|e+a(l+c)< 

I  ¥  -  r        q.e.d. 

1   o     ' 

Nov  suppose  |  W  -  r   <  K;  by  k: 

I  Wl  _  r  I  <  K  {(!  +  e)  (l  +  cl)  (l  +  b)  -  1  -  e }    +  |  r  |  e  +  a  (l  +  c) 

<  K  -  K  (2  +  e  -  (l  +  e)  (l  +  d)  (l  +  b))  +   r  |  e  +  a  (l  +  c)  <  K    q.e.d. 


2-N  |    I! 

Lemma  ;    B  < ' ' r=  (if  the  denominator  <  0,  the  expression  must  be 

1  -  b  -  2d  -  2        .    _  .      v 

replaced  by  +  ooj  . 

Proof 

2'N  1  r  1  t   a       (2'N  |  r  |  +  a)  (l  +  2~N) 
1  -  b  -  2d  -  2     (l  -  b  -  2d  -  2  )  (l  +  2  ™) 

=  (2-N  |  r  |  -fa)  (1  +  2-N) 

"  2  +  2"N  -  (1  +  b)  (1  +  d)  (1  +  2"N)  -  d  (1  -  b)  (1  +  2"N)  -  2"2N 

^     2~N  |  r  1  ±   (I  +  2"*)  a  g     .  e  d> 

"  2  +  2_N  -  (1  +  b)  (1  +  d)  (1  +  2"N)    N 

Proof  of  theorem  a  By  equation  5 J 

Y  .  =  (l  +  e)  (Y  +  (l  +  tj)  G(Y  )  +  t)   with  I  e   <  2~N„ 
n+1  n  n  '    '  — 

.   .    .   -         ,   0-N     _.  ,  _    2"N  |  r   +  a  (l  +  2~N) 

by  replacing  m  lemma  1  e  by  2  ,   we  find  K  =  — ' ' s ' — 

2  +  2"N  -  (1  +  d)  (1  +b)  (1  +  2"N) 

then  the  theorem  a  and  the  lemma  1  are  equivalent,  since  there  exists  only  a 
finite  number  of  normalized  numbers. 

The  lemma  2  completes  the  proof. 

Proof  of  theorem  b  Since  there  exists  only  a  finite  number  of  normalized  numbers, 
the  theorem  b  is  equivalent  to  the  following  assertions: 

I.  If  |  Y  -  r  |  < ,  then  I  L  -  r  I  <  B. 

2  -  (1  +  d)  (1  +  b)  X         A 

II.  If <  I  Y  -  r  !  <  B.  ,  then  I  Y,  -  r  |  <  B. 

2  -  (1  +  d)  (1  +  b)      °         A         1  A 

III.  If  I  Y  -  r  I  >  B.,  then  I  Y,  -  r  I  <  I  Y  -  r 

i   o     '  —  A       '1     '    '   o     ' 

I .    By  lemma  L  r <  Y  +  G(Y  )  <  r  + 


2  -  (1  +  d)  (1  +  b)    u      u        2  -  (1  +  d)  (1  +  b) 

since  2"N+1  fr  +  G(Yj)  <  (  |  r  |  +  - )  2"N+1,  by 

°  2  -  1  (1  +  d)  (1  +b) 


equation  6,  we  have: 


(        r       +  S )    2*1  <  T 


2   -    (l  +  d)    (l  +  b)  2   -    (l  +  d)    (l  +  b) 

a                           /    i         i                              a  \    ~-N+l 

<  r  +  +   (        r       +  )    2 

2   -    (l  +  d)    (1  +  b)  2   -    (l  +  d)    (l  +  b) 

i                    iii      -N+l                             a 
and  consequently       Y     -  r  r       2  +  <  B. 

1  "  2   -    (1  +  d)    (1  +b)  A 

II.      Suppose  that  r  +  <  Y     <  r  +  B      (the  proof  is 

2   -    (1  +  d)    (1  +b)  °  A 

a  \ 

analogous  when  r  -  B     <  Y     <r-  )  .      By  lemma  1 


2  -  (l  +  d)  (l  +  b) 


A    o 

2r  -  Y  <  Y  +  G(Y  )    <Y   , 
o    o      o     o 

r  -  B.  <  Y  +  G(Y  )  <  Y  ; 
A    o    v  o     o 

but  r-B.  <r-   r   2     -  2     <r<Y  and  there  exists  a  normalized 

A  —  '         '  o 

number  s   such  that  r-        r|2  -2  <s<r;we   apply  equation  8: 

r  -  BA  <    [Yo  +  G(yQ)]A  <  Yo  <  r  +  BA,   i.e.    |  ^  -  r   |  <  BA     q.e.d. 

III.   Suppose  Y  >  r  +  B  (the  proof  is  analogous  when  Y  <  r  -  B. ) .  By  lemma  1: 

2  r  -  Y  <  Y  +  G(Y  )  <  Y  ; 
o    o    v  o     o 

but  2r-Y<r-2  r        -2  <r<Y     and  there  exists  a  normalized 

o  —  '         '  o 

number  s   such  that  r-        r    |   2  -2  <s<r;we   apply  equation  8: 

2r   -  Y     <   [Y     +  G(Y   )]  .    <  Y  ,    i.e.     I    Y,    -   r    I    <    I    Y     -   r  q.e.d. 

o  o  o/JA  o^  '      1  '         '      o  ' 


